In this project will use one of the exploratory data analysis techniques to explore the dataset of Red Wine Quality by using R. The Wine Quality description file which describes the variables and their meanings and how the data was collected.
First, Let’s take a look at the dataset
## Observations: 1,599
## Variables: 13
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Observations: As we can see, the dataset consists of 12 variables and 1599 observations. Eleven of the variables are numerical except for the quality. The quality variable represents as an ordered factor. Its range from 3 to 8 with 6 being the median.
In this part, will show all variables with univariate analysis with plots.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
There are 1599 observations. All observation are basised on 11 features which are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH,sulphates, and alcohol.
The main feature in the data set is the quality.
I think alcohol content, pH and total acidity level (volatile.acidity, fixed.acidity, citric.acid) will determine quality.
No, I didn’t create a new variable in the data set.
There are no unusual distributions, no missing attribute values. This dataset is totally tidy and no need to change the form of the data.
We can see that a higher citric acid will higher the quality.
It looks the residual sugar has a low impact on the quality of red wines.
It looks the pH has a low impact on the quality of red wines.
It looks the alcohol content has a high impact on the quality of red wines.
It looks the sulphates have a high impact on the quality of red wines.
We can see that quality increased when fixed acidity has been increased.
We can see that residual sugar increased when density has been increased.
The density decrease with increase in the Alcohol content.
I have found that the correlation coefficient is positive for quality with citric. acid, fixed. acidity, sulphates, alcohol content. Also, it is positive between residual sugar and density.
It is interesting to see the relation between the density and the alcohol and sugar content. ### What was the strongest relationship you found?
The strongest relationship that I found in this dataset was between the quality and the alcohol content. # Multivariate Plots Section
## corrplot 0.84 loaded
We can observe the quality of wines is impacted by pH. For the range of pH between 3.2 and 3.6 the quality was better.
The high-quality wine contains a high quantity of sulphates and alcohol.
The high-quality wine contains a high quantity of citric acid and alcohol.
The high-quality wines contain high alcohol with low volatile acidity.
-We have seen how alcohol and volatile acidity relates to quality.
-The high-quality wines contain high alcohol with low volatile acidity.
-The high-quality wine contains a high quantity of alcohol and citric acid.
-The higher amounts of alcohol with low volatile acidity content yield the best quality of wines.
-Also, the high quantity of sulphates and alcohol make the best wines.
From the scatterplot, we can see that the quality is increasing with the increase in alcohol content. They have a positive and strong relati
As we can see from the histogram the quality of wines increased when the sulphates increased.
When the amount of fixed acidity decrease the quantity of pH increased. The relationship between them is negative.
The red wine dataset contains 1599 information red wines. By using R, I had tried to get a sense of what factors might affect the quality of the wine to make it best. As we have found to the high-quality wines contain high alcohol with low volatile acidity. The high alcohol with low volatile acidity makes best wines. Also, the higher amounts of alcohol with low volatile acidity content yield the best quality of wines. The relationship between alcohol and citric acid with quality was positive and strong. The interesting to observed in this dataset was the relationship between alcohol and sugar with the density of wines.